Table of contents

Data dictionary

Dataset 1 (df): 10 years of daily weather observations from different locations in Australia

Data: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package \ Data source: http://www.bom.gov.au/climate/dwo/ and http://www.bom.gov.au/climate/data \ Definitions: http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml

Dataset 2 (df_2): Daily maximum temperature (degrees Celsius) in Melbourne Airport weather station from 1970 to 2021 \ Station Details

Data source: http://www.bom.gov.au/jsp/ncc/cdio/weatherData/av?p_nccObsCode=122&p_display_type=dailyDataFile&p_startYear=&p_c=&p_stn_num=086282

Dataset 3 (latlong): Dataset containing 543 prominent cities in Australia. Each row includes a city's latitude, longitude, state and other variables of interest.

Data: https://simplemaps.com/data/au-cities

Data preprocessing dataset 1

Date

Date column all filled correctly

Location

Location column filled correctly

Evaporation

Evaporation has many values missing (~43%) and most <15 (~97%); choice: linear interpolation then drop the ones that could not be interpolated.

Sunshine

Sunshine has more of a normal distribution but almost half (48%!!) of entries missing; choice: drop from analysis.

Cloud9am and Cloud3pm

Bimodal distributions for Cloud9am and Cloud3pm: Cloud9am is 37% NAN and Cloud3pm is 39% NAN. Choice: fill NAN entries by linear interpolation

Rainfall

Rainfall almost always <1. Choice: fill NAN entries with mean.

WindGustSpeed

Also choosing mean for missing (7%) for wind gust speeds as that looked normal.

MinTemp, MaxTemp, Humidity9am, Humidity3pm, Temp9am, Temp3pm

These columns had only about 1-3% NAN entries overall. Choice: drop NAN rows for acceptable data loss (~4% rows lost from 139411 to 133810 entries)

WindGustDir, WindDir9am, WindDir3pm

Wind directions are categorical and vary with other features (can check correlations) but these only have around ~7% of their entries missing. It is tricky to decide on filling NAN with mode or means as this may not make much sense; choice: drop rows of NAN. Now have another ~11% rows lost but 119k+ remain (from 133810 to 119409)

Pressure9am and Pressure3pm

Pressure distributions are approx normal and mean is used to fill their NAN entries.

RainToday and RainTomorrow

RainToday and RainTomorrow have 77% No and 22% Yes; Choice: since this are the target variables we just drop the missing values row-wise ($\sim$2%)

Data preprocessing dataset 2

Data preprocessing dataset 3

EDA and visualization

From the plot we can see that there's a great imbalance between the two classes; observing rain is quite anomalous in large parts of Australia, after all. It is then incumbent on us to avoid metrics such as accuracy when evaluating the predictive models' performances largely due to the fact that deterministic predictors (classifying the datapoints as 0 with probability 1) will be very high. An alternative to accuracy would then be the average precision recall since this metric takes into account both positive and negative classes. Simply put, precision is the ratio of true positives to total predicted positives: we want almost all our forecasted days of rain to be be days where this anomaly was observed.

Question 1

Question 2

Hypothesis Testing: Is there a significant difference in frequency of precipitation between the four locations we just defined?

Given the p-value we can see that there's a difference between the frequency of precipitation in the four locations. We can investigate it more

Question 3

To simplify analysis, we focus on the month instead of the day and year as a time feature; computational resource constraints also prompt us to label Locations as regions, rather than complexifying this further with individual city labels and considering both latitude and longitude.

One-hot encoding of categorical variables (Raintoday already encoded since it is a binary variable)

Standardization (z-score) of numerical variables

We stratify data with respect to the prescence of rain tomorrow as then otherwise most of the data splits would indicate no rain (as these are in overwhelming abundance in the data; it doesn't rain much in Australia, fancy that).

Bar plot of top 20 feature importances

We can visualize the performance of our architectures that had hyperparameters tuned through GridSearchCV (logistic regression and random forest) through a simple plot below. For logistic regression, we see that the choice of parameter candidate values that we chose do not greatly impact the predictive performance of the model overall.

We also see how our CV results shaped up over time. For logistic regression, all model parameter combinations perform generally "the same" over time. I say this in quotes as statistical tests (such as a corrected t-test) need to be performed to ascertain whether two models are statistically different. Note that the performance of the model highly depends on the fold.

Since all models are evaluated on the same k partitions without any randomization of the data thereafter (as opposed to setting the cv parameter in GridSearch to, say, RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=42)) we expect covariance between models -- made significant observing that the model was not sensitive to the candidate parameter combinations. The decision was made not to use this repeated stratified K-Fold with different randomization in each repetition due to the availability of machines capable of handling such computations at the homes of the authors.

However, only by further testing (such as by conducting a corrected pairs t-test) would allow us to ascertain if these models are statistically different (or not).

As for randomized forestry, we observe that different combinations of paramaters led to a greater variety of mean test scores. We would have to determine if these models were statistically different with further testing.

We observe that the random forest models with the entropy criterion generally performed better over the gini variety. A weak argument could be made that, given the criterion, increasing the number of trees in the forest and the maximum depth of a tree yields more performant models under AUC.